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1 Introduction 


If, in the mid 1980’s, one had asked the average statistician about the diffi¬ 
culties of using Bayesian Statistics, the most likely answer would have been 
“Well, there is this problem of selecting a prior distribution and then, even 
if one agrees on the prior, the whole Bayesian inference is simply impossi¬ 
ble to implement in practice!” The same question asked in the 21th Century 
does not produce the same reply, but rather a much less aggressive com¬ 
plaint about the lack of generic software (besides winBUGS), along with 
the renewed worry of subjectively selecting a prior! The last 20 years have 
indeed witnessed a tremendous change in the way Bayesian Statistics are 
perceived, both by mathematical statisticians and by applied statisticians 
and the impetus behind this change has been a prodigious leap-forward in 
the computational abilities. The availability of very powerful approximation 
methods has correlatively freed Bayesian modelling, in terms of both model 
scope and prior modelling. This opening has induced many more scientists 
from outside the statistics community to opt for a Bayesian perspective as 
they can now handle those tools on their own. As discussed below, a most 
successful illustration of this gained freedom can be seen in Bayesian model 
choice, which was only emerging at the beginning of the MCMC era, for lack 
of appropriate computational tools. 

In this chapter, we will first present the most standard computational chal¬ 
lenges met in Bayesian Statistics (Section [2]), and then relate these problems 
with computational solutions. Of course, this chapter is only a terse intro¬ 
duction to the problems and solutions related to Bayesian computations. For 
more complete references, see Robert and Casella (2004), Marin and Robert 


(2007a), Robert and Casella (2004) and Liu (20011, among others. We also 


restrain from providing an introduction to Bayesian Statistics per se and for 


comprehensive coverage, address the reader to Marin and Robert (2007a) and 


Robert (2007), (again) among others. 
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2 Bayesian computational challenges 

Bayesian Statistics being a complete inferential methodology, its scope en¬ 
compasses the whole range of standard statistician inference (and design), 
from point estimation to testing, to model selection, and to non-parametrics. 
In principle, once a prior distribution has been chosen on the proper space, 
the whole inferential machinery is set and the computation of estimators is 
usually automatically derived from this setup. Obviously, the practical or nu¬ 
merical derivation of these procedures may be exceedingly difficult or even 
impossible, as we will see in a few selected examples. Before, we proceed with 
an incomplete typology of the categories and difficulties met by Bayesian in¬ 
ference. First, let us point out that computational difficulties may originate 
from one or several of the following items: 

(i) use of a complex parameter space, as for instance in constrained param¬ 
eter sets like those resulting from imposing stationarity constraints in 
dynamic models; 

(ii) use of a complex sampling model with an intractable likelihood, as for 
instance in missing data and graphical models; 

(iii) use of a huge dataset; 

(iv) use of a complex prior distribution (which may be the posterior distribu¬ 
tion associated with an earlier sample); 

(v) use of a complex inferential procedure. 


2.1 Bayesian point estimation 

In a formalised representation of Bayesian inference, the statistician is given 
(or selects) a triplet 

— a sampling distribution, f(x\0), usually associated with an observation (or 
a sample) x\ 

— a prior distribution it (6), defined on the parameter space (9; 

— a loss function L (6,d) that compares the decisions (or estimations) d for 
the true value 9 of the parameter. 

Using (/, 7 r, L) and an observation x, the Bayesian inference is always given 
as the solution to the minimisation programme 

min [ L(0,d) f(x\9)ir(9)d9, 
d Jo 

equivalent to the minimisation programme 


min [ L(9,d)n(0\x)d0. 

d Jo 

The corresponding procedure is thus associated, for every x, to the solution 
of the above programme (see, e.g. Robert 2007 Chap. 2). 
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There are therefore two levels of computational difficulties with this res¬ 
olution: first the above integral must be computed. Second, it must be min¬ 
imised in d. For the most standard losses, like the traditional squared error 
loss, 

L(0,d) = \9 — d\ 2 , 

the solution to the minimisation problem is universally 1 known. For instance, 
for the squared error loss, it is the posterior mean, 


f 9tt(9\x)(19 = f 9 f(x\9) w(9) dd / f f(x\9)n(9)d9, 

J0 JO / JO 


which still requires the computation of both integrals and thus whose com¬ 
plexity depends on the complexity of 0, f(x\0), and ^ r(0). 


Example 1. For a normal distribution A f(0, 1), the use of a so-called conjugate 
prior (see, e.g. 


Robert 2007 Chap. 3) 

0 ~N{n,e) 


leads to a closed form expression for the mean, since 


9f{x\0)n{6)d9 / f)x\9)^9)A9 = 


/e 


io 


[ 0exp ^ {— 9 2 (1 + e 2 ) + 2 9(x + e 2 /u)} d 9 

J r 2 

/ f exp i {-9 2 [l + e -2 ) + 29{x + d9 

/ Js. 2 


x + e 2 /x 
1 + e~ 2 


On the other hand, if we use instead a more involved prior distribution like a 
poly-< distribution (Bauwensand Richard 1985), 


k 

7r(0) = ]^[ [cq + {9 - Pi)' 2 ] a, v > 0 


the above integrals cannot be computed in closed form anymore. This is not 
a toy example in that the problem may occur after a sequence of k Student's 
t observations, or with a sequence of normal observations whose variance is 
unknown. 

The above example is one-dimensional, but, obviously, bigger challenges 
await the Bayesian statistician when she wants to tackle high-dimensional 
problems. 

Example 2. In a generalised linear model, a conditional distribution of y £ R 
given x £ M p is defined via a density from an exponential family 


y\x - exp {y ■ 9(x) - i/)(9(x :))} 
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whose natural parameter 9(x) depends on the conditioning variable x, 

6{x) = g(p T x ), f3 GW 

that is, linearly modulo the transform g. Obviously, in practical applications like 
Econometrics, p can be quite large. Inference on /3 (which is the true parameter of 
the model) proceeds through the posterior distribution (where x = (x \,..., x t) 
and y = ( yi ,...,y T )) 


tt(/3|x, y) oc exp {y t ■ 6{x t ) - ip{9(x t ))} tt(/3) 

t= 1 

( T T \ 

= ex p) yt ■ - X! \ > 

L i=l i= 1 J 

which rarely is available in closed form. In addition, in some cases ip may be 
costly simply to compute and in others T may be large or even very large. Take 
for instance the case of the dataset processed by Abowd et al. (1999), which 


covers twenty years of employment histories for over a million workers, with x 
including indicator variables for over one hundred thousand companies. 

Complexity starts sharply increasing if we introduce in addition random effects 
to the model, that is writing 6(x) as g(/3 T x + e(x)), where e(a;) is a random 
perturbation indexed by x. For instance, in the above employment dataset, it may 
correspond to a worker effect or to a company effect. The difficulty is that those 
random variables can very rarely be integrated out into a closed-form marginal 
distribution. They must therefore be included within the model parameter, which 
then increases its dimension severalfold. 


A related, although conceptually different, inferential issue concentrates 
upon prediction , that is, the approximation of a distribution related with the 
parameter of interest, say g(y\6), based on the observation of x ~ f(x\9). 
The predictive distribution is then defined as 


n(y\x)= [ g(y\9)TT(9\x)d9. 

JO 

A first difference with the standard point estimation perspective is obviously 
that the parameter 9 vanishes through the integration. A second and more 
profound difference is that this parameter is not necessarily well-defined any¬ 
more. As will become clearer in a following Section, this is a paramount 
feature in setups where the model is not well-defined and where the statis¬ 
tician hesitates between several (or even an infinity of) models. It is also a 
case where the standard notion of identifiability is irrelevant, which paradox¬ 
ically is a ’’plus” from the computational point of view, as seen below in, e.g., 
Example [13] 
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Example 3. Recall that an AR(p) model is given as the auto-regressive repre¬ 
sentation of a time series, 


p 

X t = '^2 ®i x t-i + . 

i—1 

It is often the case that the order p of the AR model is not fixed a priori , but 
has to be determined from the data itself. Several models are then competing 
for the “best" fit of the data, but if the prediction of the next value x t+ i is the 
most important part of the inference, the order p chosen for the best fit is not 
really relevant. Therefore, all models can be considered in parallel and aggregated 
through the predictive distribution 

Tr(x t +i\x t , ...,xi)cx J f(x t + i\x t ,... ,x t -p + i)n(0,p\x t ,. ■.,Xi)6pd0 , 

which thus amounts to integrating over the parameters of all models, simultane¬ 
ously: 


OO „ 

^ / f(xt+i\x t ,".>x t - p + 1 )'ir(0\p,x u ...,x 1 )dOn(p\x u ...,x 1 ). 
p= o*' 

Note the multiple layers of complexity in this case: 

(i) if the prior distribution on p has an infinite support, the integral simultane¬ 
ously considers an infinity of models, with parameters of unbounded dimen¬ 
sions; 

(ii) the parameter 0 varies from model AR(p) to model AR(p + 1), so must be 

evaluated differently from one model to another. In particular, if the stationar- 
ity constraint usually imposed in these models is taken into account, the con¬ 
straint on 0 P ) varies 2 between model AR(p) and model AR(p+\)\ 

(iii) prediction is usually used sequentially: every tick/second/hour/day, the next 
value is predicted based on the past values Xt, ■ ■. ,X\). Therefore when t 
moves to t + 1 , the entire posterior distribution n(0,p\xt, ■ ■ ■ , Xi) must be 
re-evaluated again, possibly with a very tight time constraint as for instance 
in financial or radar tracking applications. 

We will discuss this important problem in deeper details after the testing 
section, as part of the model selection problematic. 

2.2 Testing hypotheses 

A domain where both the philosophy and the implementation of Bayesian 
inference are at complete odds with the classical approach is the area of 
testing of hypotheses. At a primary level, this is obvious when opposing the 
Bayesian evaluation of an hypothesis H 0 : 0 £ 0 O 



6 


Pr 7r (0 G &o\x) 


with a Neyman- Pearson p- value 


sup Pr*(T(X) > T(x)) 
0e&o 


where T is an appropriate statistic, with observed value T(x). The first quan¬ 
tity involves an integral over the parameter space, while the second provides 
an evaluation over the observational space. At a secondary level, the two an¬ 
swers may also strongly disagree even when the number of observations goes 
to infinity, although there exist cases and priors for which they agree to the 
order 0(n -1 ) or even O(n -3 / 2 ). (See Robert 2007, Section 3.5.5 and Chapter 
5, for more details.) 

From a computational point of view, most Bayesian evaluations of the 
relevance of an hypothesis—also called the evidence —given a sample x involve 
marginal distributions 

[ f(x\6i)TTi(6i)d0i ( 1 ) 


where 6>j and 7r,; denote the parameter space and the corresponding prior, 
respectively, under hypothesis Hi (i = 0,1). For instance, the Bayes factor 
is defined as the ratio of the posterior probabilities of the null and the alter¬ 
native hypotheses over the ratio of the prior probabilities of the null and the 
alternative hypotheses, i.e., 


_ Pr(6> € 6> 0 | x) / 7i \e € 6> 0 ) 
01 () Pr(6> &Oi\x) / tt(6 G Of)' 


This quantity is instrumental in the computation of the posterior probability 


Pr(0 G 6> 0 | a;) 


1 

1 + BUx) 


under equal prior probabilities for both 0q and Q \. It is also the central tool 
in practical (as opposed to decisional) Bayesian testing (Jeffreys 196l| and 
can be seen as the Bayesian equivalent of the likelihood ratio. 

The first ratio in Bq X (x) is then the ratio of integrals of the form ([T]) and 
it is rather common to face difficulties in the computation of both integrals. 3 


Example 4 (Continuation of Example [ 2 ]). In the case of the generalised 
linear model, a standard testing situation is to decide whether or not a factor, 
X\ say, is influential on the dependent variable y. This is often translated as 
testing whether or not the corresponding component of /?, /3i, is equal to 0, 
i.e. @0 = {/3;/3i = 0}. if we denote by /3_i the other components of /?, the 
Bayes factor for this hypothesis will be 
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1 1 = l t— l J ' 



when 7r_i is the prior constructed for the null hypothesis and when the prior 
weights of Hq and of the alternative are both equal to 1/2. Obviously, besides 
the normal conjugate case, both integrals cannot be computed in a closed form. 

In a related manner, confidence regions are also mostly intractable, being 
defined through the solution to an implicit equation. Indeed, the Bayesian 
confidence region for a parameter 9 is defined as the highest posterior region , 


{ 0 ; k(0\x) > k(x)} 

where k(x) is determined by the coverage constraint 
Vv v (tt(0\x) > k{x)\x) = a , 


( 2 ) 


a being the confidence level. While the normalising constant is not necessary 
to construct a confidence region, the resolution of the implicit equation <[2j) is 
rarely straightforward! However, simulation-based equivalents generally pro¬ 
duce acceptable approximations in a straightforward manner when both the 
prior and the likelihood can be numerically computed. 

Example 5. Consider a binomial observation x ~ B(n , 9) with a conjugate prior 
distribution, 9 ~ Be( 71 , 72 )- In this case, the posterior distribution is available 
in closed form, 

9\x ~ Be (71 + x, 72 + n - x). 

However, the determination of the 9's such that 


071+s-l^ _ e y 2 +n-x- 1 > 


with 


pr - oy2+n-x-i > k ( x }\ x } = a 


is not possible analytically. It actually implies two levels of numerical difficulties: 

1. find the solution(s) to 0 7l+a:-1 (l — 9) 1 2+n-x-i _ p, 

2. find the k corresponding to the right coverage, 

and each value of k examined in step 2 . requires a new resolution of step 1 . 
However, k can also be interpreted as the (1 —cr) quantile of the random variable 
071 +*—i(i _ 0 ) 72 +n-x-i i hence derjved f rom a | arge sa mp| e of 9's in a Monte 
Carlo perspective. 
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The setting is usually much more complex when 9 is a multidimensional 
parameter, because the interest is usually in getting marginal confidence sets. 
Example [2] is an illustration of this setting: deriving a confidence region on 
one component, j3\ say, first involves computing the marginal posterior dis¬ 
tribution of this component. As in Example |4j the integral 

x t )| 7r_i(/3_i)d/3_i, 

which is proportional to 7t(/3i|x), is most often intractable. Fortunately, the 
simulation approximations mentioned above are also available to bypass this 
integral computation. 


/ ex P j E V* ' 9{P T Xt) - E 

• 7 ® p - 1 U=i t= l 




2.3 Model choice 

Although they are essentially identical from a conceptual viewpoint, we do 
distinguish here between model choice and testing, partly because the former 
leads to further computational difficulties, and partly because it encompasses 
a larger scope of inferential goals than mere testing. Note first that model 
choice has been the subject of considerable effort in the past decades, and has 
seen many advances, including the coverage of problems of higher complexity 
and the introduction of new concepts. We stress that such advances mostly 
owe to the introduction of new computational methods. 

As discussed in further details in Robert (2007, Chapter 7), the inferential 
action related with model choice does take place on a wider scale than simple 
testing: it covers and compares models, rather than parameters, which makes 
the sampling distribution f(x) “more unknown” than simply depending on 
an undetermined parameter. In some respect, it is thus closer to estimation 
than to regular testing. In any case, it requires a more precise evaluation of 
the consequences of choosing the “wrong” model or, equivalently of deciding 
which model is the most appropriate for the data at hand. It is thus both 
broader and less definitive than deciding whether Hq : 9\ = 0 is true. At 
last, the larger inferential scope mentioned in the first point means that we 
are leaving for a while the well-charted domain of solid parametric models. 

From a computational point of view, model choice involves more complex 
structures that, almost systematically, require advanced tools, like simulation 
methods which can handle collections of parameter spaces (also called spaces 
of varying dimensions), specially designed for model comparison. 

Example 6. A mixture of distributions is the representation of a distribution 
(density) as the weighted sum of standard distributions (densities). For instance, 
a mixture of Poisson distributions, denoted as 

k 

2=1 
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has the following density: 



This representation of distributions is multi-faceted and can be used in popula¬ 
tions with known heterogeneities (in which case a component of the mixture cor¬ 
responds to an homogeneous part of the population) as well as a non-parametric 
modelling of unknown populations. This means that, in some cases, k is known 
and, in others, it is both unknown and part of the inferential problem. 

First, consider the setting where several (parametric) models are in com¬ 
petition, 


Wti : x ~ /i(a;|0i), e O i} i£l 


the index set I being possibly infinite. From a Bayesian point of view, a prior 
distribution must be constructed for each model 9 Jli as if it were the only 
and true model under consideration since, in most perspectives except model 
averaging, only one of these models will be selected and used as the only and 
true model. The parameter space associated with the above set of models can 
be written as 



(3) 


the model indicator p £ I being now part of the parameters. So, if the 
modeller allocates probabilities pi to the indicator values, that is, to the 
models 931,; (i £ I), and if she then defines priors 7r,(0j) on the parameter 
subspaces (9,;, things fold over by virtue of Bayes’s theorem, since one can 
compute 



While a common solution based on this prior modelling is simply to take the 
(marginal) MAP estimator of p, that is, to determine the model with the 
largest p(DJli\x), or even to use directly the average 



as a predictive density in y in model averaging, a deeper-decision theoretic 
evaluation is often necessary. 

Example 7 (Continuation of Example [3]). in the setting of the AR{p) 
models, when the order p of the dependence is unknown, model averaging as 
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presented in Example [3] is not always a relevant solution when the statistician 
wants to estimate this order p for different purposes. Estimation is then a more 
appropriate perspective than testing, even though care must be taken because 
of the discrete nature of p. (For instance, the posterior expectation of p is not 
an appropriate estimator!) 

As stressed earlier in this Section, the computation of predictive densities, 
marginals, Bayes factors, and other quantities related to the model choice 
procedures is generally very involved, with specificities that call for tailor- 
made solutions: 

- The computation of integrals is increased by a factor corresponding to the 
number of models under consideration. 

Some parameter spaces are infinite-dimensional, as in non-parametric set¬ 
tings and that may cause measure-theoretic complications. 

- The computation of posterior or predictive quantities involves integration 
over different parameter spaces and thus increases the computational bur¬ 
den, since there is no time savings from one subspace to the next. 

- In some settings, the size of the collection of models is very large or even 
infinite and some models cannot be explored. For instance, in Example 
[4j the collection of all submodels is of size 2 P and some pruning method 
must be found in variable selection to avoid exploring the whole tree of all 
submodels. 


3 Monte Carlo Methods 

The natural approach to these computational problems is to use computer 
simulation and Monte Carlo techniques, rather than numerical methods, sim¬ 
ply because there is much more to gain from exploiting the probabilistic prop¬ 
erties of the integrands rather than their analytical properties. In addition, 
the dimension of most problems considered in current Bayesian Statistics 
is such that very involved numerical methods should be used to provide a 
satisfactory approximation in such integration or optimisation problems. In¬ 
deed, down-the-shelf numerical methods cannot handle integrals in moderate 
dimensions and more advanced numerical integration methods require ana¬ 
lytical studies on the distribution of interest. 

3.1 Preamble: Monte Carlo importance sampling 

Given the statistical nature of the problem, the approximation of an integral 


like 
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should indeed take advantage of the special nature of 3, namely, the fact that 
7 T is a probability density 4 or, instead, that f(x\6)n(9) is proportional to a 
density. As detailed in Chapter II.2 this volume, or in Robert and Casella 
(2004, Chapter 3) and Robert and Casella (2010), the Monte Carlo method 
was introduced by Metropolis and Ulam (1949) for this purpose. For instance, 
if it is possible to generate (via a computer) random variables 9 1,..., 9 m from 
7 r(0), the average 


— J2h(6i)f(x\0i) 
m z ' 


converges (almost surely) to 3 when m goes to +oo, according to the Law 
of Large Numbers. Obviously, if an i.i.d. sample of 9i s from the posterior 
distribution Tt(0\x) can be produced, the average 


1 m 

-EW (4) 

m z -—' 


converges to 


E n [h(6)\x] 


f 0 h(6)f(x\eM9)de 

I 0 f(x\9)TT(9)de 


and it usually is more interesting to use this approximation, rather than 


J2h{0i)f(x\6i) / J2f(x\0i) 


when the 0i s are generated from 7r(0), especially when 7r(0) is flat compared 
with ■n(9\x). 

In addition, if the posterior variance var(/i(0)|a;) is finite, the Central Limit 
Theorem applies to the empirical average 0- which is then asymptotically 
normal with variance w&x(h(9)\x)/ to. Confidence regions can then be built 
from this normal approximation and, most importantly, the magnitude of 
the error remains of order 1 /y/rn, whatever the dimension of the problem, 


in opposition with numerical methods. 5 (See also Robert and Casella 


2004 


2009, Chapter 4, for more details on the convergence assessment based on 


the CLT.) 

The Monte Carlo method actually applies in a much wider generality than 
the above simulation from 7r. For instance, because 3 can be represented in 
an infinity of ways as an expectation, there is no need to simulate from the 
distributions 7 t(-|:e) or 7r to get a good approximation of 3. Indeed, if g is a 
probability density with supp(g) including the support of \h(9)\f(x\9)n(9), 
the integral 3 can also be represented as an expectation against g , namely 


f H0)f(x\0M0) 

J g(e) 


9(0) dd. 
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This representation leads to the Monte Carlo method with importance func¬ 
tion g: generate according to g and approximate 3 through 


1 

TO 


i= 1 


with the weights oj(9i) = f{x\6i)'K{6 i )/g(6i). Again, by the Law of Large 
Numbers, this approximation almost surely converges to 3. And this estima¬ 
tor is unbiased. In addition, an approximation to ¥?[h{0)\x] is given by 

since the numerator and denominator converge to 


Z?=M0i) 


h(0)f(x\0)n(9)dO and 


f(x\0)n(0)d0, 


respectively, if supp(g) includes supp(/(a;|-)7r). Notice that the ratio ([5]) does 
not depend on the normalising constants in either h(0), f(x\0) or ir(9). The 
approximation <§ can therefore be used in settings when some of these nor¬ 
malising constants are unknown. Notice also that the same sample of 9f s 
can be used for the approximation of both the numerator and denominator 
integrals: even though using an estimator in the denominator creates a bias, 
<[5j) does converge to E 7r [/i(0)|a;]. 

While this convergence is guaranteed for all densities g with wide enough 
support, the choice of the importance function is crucial. First, simulation 
from g must be easily implemented. Moreover, the function g{9) must be close 
enough to the function h(0)Tr(0\x), in order to reduce the variability of ([5| 
as much as possible; otherwise, most of the weights ui(0i) will be quite small 
and a few will be overly influential. In fact, if E h [h 2 (0)ui 2 (0)\ is not finite 
the variance of the estimator |5]) is infinite (see |Robert and Casella, 2004 


Chapter 3). Obviously, the dependence on g of the importance function h can 
be avoided by proposing generic choices such as the posterior distribution 
tt(9\x). 


3.2 First illustrations 

In either point estimation or simple testing situations, the computational 
problem is often expressed as a ratio of integrals. Let us start with a toy 
example to set up the way Monte Carlo methods proceed and highlight the 
difficulties of applying a generic approach to the problem. 

Example 8. Consider a f-distribution T(v,9, 1) sample (xi,...,x n ) with v 
known. Assume in addition a flat prior 7r(0) = 1 as in a non-informative en¬ 
vironment. While the posterior distribution on 9 can be easily plotted, up to a 
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Fig. 1 . Posterior density of 9 in the setting of Example |8] for n = 10, with a 
simulated sample from T(3, 0,1). 


normalising constant (Figu re[T[), because we are in dimension 1, direct simulation 
and computation from this posterior is impossible. 

If the inferential problem is to decide about the value of 9, the posterior 
expectation is 

r. n 

E^ln, ...,x n ) = J 9 n [v + ( Xi - 9f] - ( " +1) / 2 d 9 

/ / iih + (x,-«f ]- < - +,)/2 ie. 

2=1 

This ratio of integrals is not directly computable. Since (u + (xi — 9) 2 )~^ +1 ^ 2 is 
proportional to a i-distribution T{y , Xi, 1) density, a solution to the approximation 
of the integrals is to use one of the % s to “be" the density in both integrals. 
For instance, if we generate 9\,...,Q m from the T(y,x i,l) distribution, the 
equivalent of ([5]) is 

m n 

c=e h n +(*i - erf] ~ (v+i)/2 (6) 

j =1 i =2 

m n 

j= 1 *=2 

since the first term in the product has been “used” for the simulation and the 
normalisation constants have vanished in the ratio. Figure [2] is an illustration of 
the speed of convergence of this estimator to the true posterior expectation: it 
provides the evolution of 6 ^ as a function of m both on average and on range 
(provided by repeated simulations of 6 ? n ). As can be seen from the graph, the 
average is almost constant from the start, as it should, because of unbiasedness, 
while the range decreases very slowly, as it should, because of extreme value 
theory. The graph provides in addition the 90% empirical confidence interval 
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built on these simulations. 6 The empirical confidence intervals are decreasing in 
1 /yfn, as expected from the theory. (This is further established by regressing the 
log-lengths of the confidence intervals on log(n), with slope equal to —0.5, as 
shown by Figure [3]) 



Fig. 2. Evolution of a sequence of 500 estimators |6| over 1,000 iterations: range 
(in gray), .05 and .95 quantiles, and average, obtained on the same sample as in 
Figure [I] when simulating from the t distribution with location x\. 



Fig. 3. Regression of the log-ranges (left) and the log-lengths of the confidence 
intervals (right) on log(n), for the output in Figure[2] 


Now, there is a clear arbitrariness in the choice of x\ in the sample (aq,..., 
x n ) for the proposal T(y, x\ , 1). While any of the aq's has the same theoretical 
validity to "represent" the integral and the integrating density, the choice of aq's 
closer to the posterior mode (the true value of 9 is 0) induces less variability 
in the estimates, as shown by a further simulation experiment through Figure 
[4] It is fairly clear from this comparison that the choice of extremal values like 
X(i) = —3.21 and even more £( 10 ) = 1-72 is detrimental to the quality of 
the approximation, compared with the median X( 5 ) = —0.86. The range of the 
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estimators is much wider for both extremes, but the influence of this choice is 
also visible for the average which does not converge so quickly.' 





r ^ 



Fig. 4. Repetition of the experiment described in Figure [ 2 ] for three different 
choices of xt: min xt, x ( 5 ) and max a;; (from left to right). 


This example thus shows that Monte Carlo methods, while widely avail¬ 
able, may easily face inefficiency problems when the simulated values are not 
sufficiently attuned to the distribution of interest. It also demonstrates that, 
fundamentally, there is no difference between importance sampling and regu¬ 
lar Monte Carlo, in that the integral 3 can naturally be represented in many 
ways. 

Although we do not wish to spend too much space on this issue, let us 
note that the choice of the importance function gets paramount when the 
support of the function of interest is not the whole space. For instance, a tail 
probability, associated with h(9) = I g>g 0 say, should be estimated with an 
importance function whose support is [<?o, 00 ). (See 
Chapter 3, for details.) 

Example 9 (Continuation of Example [8]) . If, instead, we wish to consider 
the probability that 9 > 0, using the t-distribution T(v,Xi, 1) is not a good idea 
because negative values of 9 are somehow simulated “for nothing”. A better 
proposal (in terms of variance) is to use the “folded” t-distribution T(y,Xi, 1), 
with density proportional to 

y’i(u) — \y + \Xi — 9) J + \y + \Xi + 9) J , 

on R + , which can be simulated by taking the absolute value of a T{y, Xi, 1) rv. 
All simulated values are then positive and the estimator of the probability is 

m 

Pm = En [" + - \°i\)T {V+1)/2 /M\o A) (7) 

i=l ijik 

m 

j=l ijtk 


Robert and Casella 2004 
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where the dj’s are iid T(is,Xk, 1). Note that this is a very special occurrence 
where the same sample can be used in both the numerator and the denominator. 
In fact, in most cases, two different samples have to be used, if only because the 
support of the importance distribution for the numerator is not the whole space, 
unless, of course, all normalising constants are known. Figure[5]reproduces earlier 
figures for this problem, when using x( 5 ) as the parameter of the t, distribution. 



Fig. 5. Evolution of a sequence of 100 estimators |7j over 1,000 iterations (same 
legend as Figure [1|). 


The above example is one-dimensional (in the parameter) and the prob¬ 
lems exhibited there can be found severalfold in multidimensional settings. 
Indeed, while Monte Carlo methods do not suffer from the “curse of dimen¬ 
sion” in the sense that the error of the corresponding estimators is always 
decreasing in 1/ 1 Jn, notwithstanding the dimension, it gets increasingly dif¬ 
ficult to come up with satisfactory importance sampling distributions as the 
dimension gets higher and higher. As we will see in Section [5j the intuition 
built on MCMC methods has to be exploited to derive satisfactory impor¬ 
tance functions. 

Example 10 (Continuation of Example [ 2 ]). A particular case of gener¬ 
alised linear model is the probit model, 

Pr e (Y = l]x) = 1 - Pr g (Y = 0|x) = <P(x T d) 9 e W °, 

where denotes the normal Af(0, 1) cdf. Under a flat prior 7 r(9) = 1, for a sample 
(xi, yi ),..., ( x n , y n ), the corresponding posterior distribution is proportional to 

n 

Y[<P(xJe)y^(-xJe) 1 -^. ( 8 ) 

i = 1 

Direct simulation from this distribution is obviously impossible since the compu¬ 
tation of <P(z) is a difficulty in itself. If we pick an importance function for this 
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problem, the adequation with the posterior distribution will need to be better and 
better as the dimension p increases. Otherwise, the repartition of the weights will 
get increasingly asymmetric: very few weights will be different from 0. 

Figure [6] illustrates this degeneracy of the importance sampling approach 
as the dimension increases. We simulate parameters /?’s and datasets ( Xi,yi ) 
(i = 1, ...,245) for dimensions p ranging from 1 to 10, then represented the 
histograms of the largest weight for p = 1,2,5,10. The xfs were simulated 
from a Af p (0,I p ) distribution, while the importance sampling distribution was a 
I p /p) distribution. 



Fig. 6. Comparison of the distribution of the largest importance weight based 
upon 150 replications of an importance sampling experiment with 245 observations 
and dimensions p = 1, 2, 5,10. 


3.3 Approximations of the Bayes factor 


As explained in Sections |2.2| and 2.3 the first computational difficulty as¬ 
sociated with Bayesian testing is the derivation of the Bayes factor, of the 
form 


S 7T _ 

12 _ 


te i 


/i(x|6»i)7ri(0i)d6»i 

/2(x|6»2)7r 2 (6 , 2)d6»2 


m\{x) 
m 2 ( x) ’ 


02 
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where, for simplicity’s sake, we have adopted a model choice perspective (that 
is, 81 and 6 2 may live in completely different spaces). 

Specific Monte Carlo methods for the estimation of ratios of normalising 
constants, or, equivalently, of Bayes factors, have been developed in the past 
five years. See Chen et al. (2000, Chapter 5) for a complete exposition, as 
well as Chopin and Robert| ( 2010| and Marin and Robert (2007b) for recent 
reassessments. In particular, the importance sampling technique is rather 
well-adapted to the computation of those Bayes factors: Given a importance 
distribution, with density proportional to g, and a sample 9 ^,..., 9 ^ sim¬ 
ulated from g , the marginal density for model ©b, mi(x ), is approximated 

by 

y" /V 7ri(g(0 ) 

h { am / ^ 9 m 5 


iO) = 


where the denominator takes care of the (possibly) missing normalising con¬ 
stants. (Notice that, if g is a density, the expectation of n(6^)/g(0^) is 1 
and the denominator should be replaced by T to decrease the variance of the 
estimator of rrii(x).) 

A compelling incentive, among others, for using importance sampling in 
the setting of model choice is that the sample (flW,..., 9^) can be recycled 
for all models ©b sharing the same parameters (in the sense that the models 
©b are parameterised in the same way, e.g. by their first moments). 


Example 11 (Continuation of Example [4]). In the case the /3’s are simu¬ 
lated from a product instrumental distribution 


p 

g{P) = - 

i—1 

the sample of f3’s produced for the general model of Example [2] ©R say, can 
also be used for the restricted model, ©R, where /R = 0, simply by deleting the 
first component and keeping the following components, with the corresponding 
importance density being 

9-i(P) = II 90i)- 

i =2 


Once the /3’s have been simulated, 8 the Bayes factor BR can be approximated 

by rhi{x)/m 2 {x). 

However, the variance of rh(x) may be infinite, depending on the choice 
of g. A possible choice is g{9) = n(9), with wider tails than n(9\x), but this is 
often inefficient if the data is informative because the prior and the posterior 
distributions will be quite different and most of the simulated values 8 ^> fall 
outside the modal region of the likelihood. (This is of course impossible when 
7 T is improper.) For the choice g{9) = f(x\9)Tr(9), 
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*(*) = l tY, 




(9) 


is the harmonic mean of the likelihoods, but the corresponding variance is 
infinite when the likelihood has thinner tails than the prior (which is often 
the case). Having an infinite variance means that the numerical value pro¬ 
duced by the estimate cannot be trusted at all, as it may be away from the 


true marginal by several orders of magnitude. As discussed in Chopin and 


Robert (20101 and Marin and Robert (2007b), it is actually possible to use 
a generalised harmonic mean estimate 






T “ 7r(0 (t ))/(x|0 (t) ) ’ 

when ip is a probability density with tails that are thinner than the posterior 
tails. For instance, a density with support an approximate HPD region is 
quite appropriate. 

Explicitly oriented towards the computation of ratios of normalising con¬ 
stants, bridge sampling was introduced in Meng and Wong| (1996): if both 
models fUR and SEJR cover the same parameter space 0, if tti(0\x) = Cini(9\x) 
and TT2(0\x) = c 2 Tt 2 (0\x)^ where C\ and C 2 are unknown normalising constants, 
then the equality 

c 2 _ E 71 ' 2 [7Ti(0|:r) h(6)] 

Ci E 7ri [7T 2 (0| x)h(0)\ 

holds for any bridge function h(9) such that both expectations are finite. The 
bridge sampling estimator is then 

1 


n 1 

j^S _ i—1 

~ 1 n 2 


n, z —* 


— 7>i {6 2 j\x) h{e 2i ) 

n *t[ 


where the Ojfs are simulated from -Kj{0\x) (j = 1,2, i = 1 ,..., nj). 


For instance, if 


h(8) = 1/ [ni(6\x)TT2(0u\x)\ , 


(1996) have derived an (asymptotically) optimal bridge function 

ni + n 2 


then Bf 2 is a ratio of harmonic means, generalising (|9j). Meng and Wong 
1 (asymj 

h*{8) = 


nini(d\x) +n 2 n 2 (0\x) ' 


This choice is not of direct use, since the normalising constants of 7Ti(0|a;) 
and 'k 2 (0 \x) are unknown (otherwise, we should not need to resort to such 
techniques!). Nonetheless, it shows that a good bridge function should cover 
the support of both posteriors, with equal weights if ni = n 2 - 
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Example 12 (Continuation of Example [ 2 ]). For generalised linear mod¬ 
els, the mean (conditionally on the covariates) satisfies 

E[;#] = VV>(0) = , 


where is the link function. The choice of the link function usually is quite 
open. For instance, when the y's take values in {0, 1}, three common choices of 
are ( McCullagh and Nelder| 1989) 


it) = exp(f)/(l+exp(f)), & 2 (t) = ^(t), and <P 3 (t) = l-exp(-exp(f)), 


corresponding to the logit, probit and log-log link functions (where <P denotes 
the c.d.f. of the distribution). If the prior distribution n on the /3's is 

a normal AT p (1;,t 2 I p ), and if the bridge function is h(fd) = l/ir(/3), the bridge 
sampling estimate is then (1 < * < j < 3) 



n 

n ^ZfAPit\x) 

L t=1 


where the flu are generated from 7Tj(/3*|:r) oc /i(/?i|ir)7r(/3j), that is, from the 
true posteriors for each link function. 


The drawback in using bridge sampling is that the extension of the method 
to cases where the two models and 9Jt 2 do not cover the same parame¬ 
ter space, for instance because the two corresponding parameter spaces are 
of different dimensions, requires the introduction of a pseudo-posterior dis¬ 
tribution (Chen et al. 2000 Marin and Robert 2007b I) that complete the 


smallest parameter space so that dimensions match. 9 The impact of those 
completion pseudo-posteriors on the quality of the Bayes factor approxima¬ 
tion is such that they cannot be suggested for a general usage in Bayes factor 
computation. 

A completely different route to the approximation of a marginal likelihood 


is Chib’s (1995), which happens to be a direct application of Bayes’ theorem: 
given x ~ fk(x\9 k ) and 9 k ~ 7r k(9k) (k = 1,2), we have that 


m k {x) = 


fk(x\9 ) 7Tfc(fl) 
TT k (d\x) 


for every value of 9 (since both the lhs and the rhs of this equation are 
constant in 9). Therefore, if an arbitrary value of 9, say 9 jf, is selected— 
e.g., the MAP estimate—and if a good approximation to TZk(9\y) can be 
constructed, denoted ir(9\y), Chib’s (1995) approximation to the marginal 
likelihood m k {x) is 
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fh k {x) = 


fk(yK)n k (0* k ) 
*k(0* k \y) 


( 10 ) 


As a first solution, Tt(0\y) may be a Gaussian approximation based on the 
MLE, but this is unlikely to be accurate in a general setting. Chib’s (1995) 


solution is to use a nonparametric approximation based on a preliminary 
Monte Carlo Markov chain (Section [4]) sample, even though the accuracy 
may also suffer in large dimensions. In the special setting of latent variables 
models (like the mixtures of distributions discussed in Example [6]), this ap¬ 
proximation is particularly attractive as there exists a natural approximation 
to 7Tfc(f%), based on the Rao-Blackwell (Gelfand and Smith, |T990] ) estimate 


Kk{0V\y) = ~ '52*k{9k\y>Zk) 


t= 1 


where the ’s are the latent variables simulated by the Monte Carlo Markov 
chain algorithm. The estimate T?k{0 k \y) is a parametric unbiased approxima¬ 
tion of TTfc (6*fc |z/) that converges to the true density with rate 0(y/T). This 
Rao-Blackwell approximation obviously requires the full conditional density 
n k(0 k \y,z) to be available in closed form (constant included) but this is the 
case for many standard models. Figure [7] reproduces an illustration from 
Marin and Robert (2007b) that compares the variability of Chib’s (1995) 
method against nearly optimal harmonic and importance sampling approx¬ 
imations in the setting of a probit posterior distribution. While the former 
solution varies more than the two latter solutions, its generic features make 
Chib’s (1995) method a reference for this problem. 



Fig. 7. In the setting of the probit modelling of R Pima Indian dataset, using 
three covariates and testing for the significance of the diabetes pedigree function, 
boxplots of 100 Chib’s, harmonic mean and importance estimates of Boi(y), based 
on simulations from the prior distributions, for 2 x 10 4 simulations ( Source: Marin 


and Robert 2007b). 
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While amenable to an importance sampling technique of sorts, the alter¬ 


native approach of nested sampling (Skilling 20061 for computing evidence 


is discussed in Chopin and Robert (2010) and Robert and Wraith (2009). 


Similarly, the specific approach based on the Savage-Dickey (Dickey, 1971 


Verdinelli and Wasserman 1995) representation of the Bayes factor associ¬ 


ated with the null hypothesis Hq : 0 = 6 o, as 


B 0 i(x) = 


ki(0q\x) 
7 T l(0o) 


can be related to the family of bridge sampling techniques, even though 
the initial theoretical foundations of the method are limited, as explained 


Robert and Marin (2009). This paper engineered a general framework 


that produces a converging and unbiased approximation of the Bayes factor, 


unrelated with the approach of Verdinelli and Wasserman (1995), that builds 


upon a mixed pseudo-posterior naturally derived from the priors under both 
the null and the alternative hypotheses. 

As can be seen from the previous developments, such methods require 
a rather careful tuning to be of any use. Therefore, they are rather diffi¬ 
cult to employ outside settings where pairs of models are opposed. In other 
words, they cannot be directly used in general model choice settings where the 
parameter space (and in particular the parameter dimension) varies across 
models, as in, for instance, Example [7] To address the computational issues 
corresponding to these cases requires more advanced techniques introduced 
in the next Section. 


4 Markov Chain Monte Carlo Methods 


As described precisely in Chapter II. 3 and in Robert and Casella (2004) 


MCMC methods try to overcome the limitation of regular Monte Carlo meth¬ 
ods through the use of a Markov chain with stationary distribution the pos¬ 
terior distribution. There exist rather generic ways of producing such chains, 
including Metropolis-Hastings and Gibbs algorithms. Besides the fact that 
stationarity of the target distribution is enough to justify a simulation method 
by Markov chain generation, the idea at the core of MCMC algorithms is that 
local exploration, when properly weighted, can lead to a valid representation 
of the distribution of interest, as for instance, when using the Metropolis- 
Hastings algorithm. 


4.1 Metropolis Hastings as universal simulator 


The Metropolis-Hastings, presented in Robert and Casella (2004) and Chap¬ 
ter II.3, offers a straightforward solution to the problem of simulating from 
the posterior distribution tt(0\x) oc f(x\0) tt(0): starting from an arbitrary 
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point 0o, the corresponding Markov chain explores the surface of this poste¬ 
rior distribution by a similarly arbitrary random walk proposal q(Q\9') that 
progressively visits the whole range of the possible values of 6. 


—Metropolis Hastings Algorithm— 


At iteration t 

1 Generate £ ~ <?(£|0^), u t ~ W([0,1]) 

2 Take 


0 (t+1 ) 



• f < 7r(6l^) g(^ (t) !6) 

* “ 7r(0(‘)|x) g(£ t |0W) 

otherwise 


Example 13 (Continuation of Example |1Q| ) . In the case p = 1, the probit 
model defined in Example [10] can also be over-parameterised as 


Pr (Yi = 1| Xi) = 1- Pr (Yi = 0|xj) = <2>(), 


since it only depends on /3/cr. The Bayesian processing of non-identified models 
poses no serious difficulty as long as the posterior distribution is well defined. 
This is the case for a proper prior like 

7 t(/3,ct 2 ) a cr -4 exp{—1 /ct 2 } exp{—/3 2 /50) 

that corresponds to a normal distribution on /3 and a gamma distribution on 
cr -2 . While the posterior distribution on (/3, cr) is not a standard distribution, it 
is available up to a normalising constant. Therefore, it can be directly processed 
via an MCMC algorithm. In this case, we chose a Gibbs sampler that simulates 
/3 and a 2 alternatively, from 

7r(/3|x,y, <r) a <P(xi/3/a) $(-XiP/a) x tt(/3) 

7/; = l Vi=0 


and _ 

7r ( cr2 |x, y, /?) oc ${xiP/<j) ${-XiP/<j) x 7r(cj 2 ) 

2/i = l !/i=0 

respectively. Since both of these conditional distributions are also non-standard, 
we replace the direct simulation by a one-dimensional Metropolis-Hastings step, 
using normal and log-normal jCAAQog.04) random walk propos¬ 

als, respectively. For a simulated dataset of 1,000 points, the contour plot of the 
log-posterior distribution is given in Figure [8] along with the last 1,000 points of 
a corresponding MCMC sample after 100,000 iterations. This graph shows a very 
satisfactory repartition of the simulated parameters over the likelihood surface, 
with higher concentrations near the largest posterior regions. For another simu¬ 
lation, Figure [ 9 ] details the first 500 steps, when started at (/3,cr 2 ) = (0.1,4.0). 
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Although each step contains both a (3 and a a proposal, some moves are ei¬ 
ther horizontal or vertical: this corresponds to cases when either the /? or the a 
proposals have been rejected. Note also the fairly rapid convergence to a modal 
zone of the posterior distribution in this case. 



Fig. 8. Contour plot of the log-posterior distribution for a probit sample of 1, 000 
observations, along with 1,000 points of an MCMC sample (Source: Robert and 
\Casella\ \20&4^- 





Fig. 9. First 500 steps of the Metropolis Hastings algorithm on the probit log- 
posterior surface, when started at (/?, a 2 ) = (0.1,4.0). 
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Obviously, this is only a toy example and more realistic probit models do 
not fare so well with down-the-shelf random walk Metropolis-Hastings algo¬ 


rithms, as shown for instance in Nobile (1998) (see also Robert and Casella 


20041 Section 1 0.3.2, |Marin and Robert} 2007a[ Section 4.3, and Marin and] 


Robert 


10 


The difficulty inherent to random walk Metropolis-Hastings algorithms 
is the scaling of the proposal distribution: it must be adapted to the shape 
of the target distribution so that, in a reasonable number of steps, the whole 
support of this distribution can be visited. If the scale of the proposal is too 
small, this will not happen as the algorithm stays “too local” and, if there 
are several modes on the posterior, the algorithm may get trapped within 
one modal region because it cannot reach other modal regions with jumps of 
too small a magnitude. The larger the dimension p is, the harder it is to set 
up the right scale, though, because 


(a) the curse of dimension implies that there are more and more empty re¬ 
gions in the space, that is, regions with zero posterior probability; 

(b) the knowledge and intuition about the modal regions get weaker and 
weaker; 

(c) the proper scaling involves a symmetric (p,p) matrix E in the proposal 
g{{9 — (9') 1 E(6 — 9')). Even when the matrix E is diagonal, it gets harder 
to scale as the dimension increases (unless one resorts to a Gibbs like 
implementation, where each direction is scaled separately). 


Note also that the on-line scaling of the algorithm against the empirical accep¬ 
tance rate is inherently flawed in that (a) the process is no longer Markovian 
and (b) the attraction of a modal region may give a false sense of convergence 
and lead to a choice of too small a scale, simply because other inodes will not 
be visited during the scaling experiment. 


4.2 Gibbs sampling and latent variable models 

The Gibbs sampler is a definitely attractive algorithm for Bayesian problems 
because it naturally fits the hierarchical structures so often found in such 
problems. “Natural” being a rather vague notion from a simulation point of 
view, it routinely happens that other algorithms fare better than the Gibbs 
sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other 
Metropolis-Hastings refinements at a later stage) in well-structured objects 
like Bayesian hierarchical models and more general graphical models. 

A very relevant illustration is made of latent variable models, where the 
observational model is itself defined as a mixture model, 

f(x\9)=[ f(x\z,6)g(z\9)dz. 

Such models were instrumental in promoting the Gibbs sampler in the sense 
that they have the potential to make Gibbs sampling sound natural very 
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easily. (See also Chapter II.3.) For instance, Tanner and Wong (1987) wrote 
a precursor article to Gelfand and Smith (1990) that designed specific two- 
stage Gibbs samplers for a variety of latent variable models. And many of 
the first applications of Gibbs sampling in the early 90’s were actually for 
models of that kind. The usual implementation of the Gibbs sampler in this 
case is to simulate the missing variables Z conditional on the parameters and 
reciprocally, as follows: 


—Latent Variable Gibbs Algorithm— 


At iteration t 


1 Generate z( t+1 ) ^ 

-g(z\9^,x) 

2 Generate 6^ t+1 '> - 

- 7r(6»|x,^ (t+1) ) 


While we could have used the probit case as an illustration (Example 10), 
as done in Chapter II.3, we choose to pick the case of mixtures (Example |6j) 
as a better setting. 


Example 14 (Continuation of Example [6]). The natural missing data 
structure of a mixture of distribution is historical In one of the first mixtures 
to be ever studied by Bertillon, in 1863, a bimodal structure on the height of 
conscripts in south eastern France (Doubs) can be explained by the mixing of 
two populations of military conscripts, one from the plains and one from the 
mountains (or hills). Therefore, in the analysis of data from distributions of the 
form 

k 

\0i), 

»=i 


a common missing data representation is to associate with each observation Xj 
a missing multinomial variable zj ~ A4k(l;pi, ■ ■ ■ ,Pk) such that Xj\zj = i ~ 
f(x \6i). In heterogeneous populations made of several homogeneous subgroups 
or subpopulations, it makes sense to interpret Zj as the index of the population 
of origin of Xj, which has been lost in the observational process. 

However, mixtures are also customarily used for density approximations, as a 
limited dimension proxy to non-parametric approaches. In such cases, the com¬ 
ponents of the mixture and even the number k of components in the mixture are 
often meaningless for the problem to be analysed. But this distinction between 
natural and artificial completion (by the Zj's) is lost to the MCMC sampler, 
whose goal is simply to provide a Markov chain that converges to the posterior 
as stationary distribution. Completion is thus, from a simulation point of view, a 
mean to generate such a chain. 


1994) 


The most standard Gibbs sampler for mixture models (Diebolt and Robert 
is thus based on the successive simulation of the zZ s and of the 6Vs, 


conditional on one another and on the data: 






















27 


1. Generate Zj\0, Xj (j = 1, ... ,n) 

2. Generate 0;|x, z (i = 1,... ,k) 

Given that the density / is most often from an exponential family, the simulation 
of the di's is generally straightforward. 

As an illustration, consider the (academic) case of a normal mixture with two 
components, with equal known variance and fixed weights, 

pN{pi, a 2 ) + (1 - p)Af(n 2 ,cr 2 ). (11) 

Assume in addition a normal A/”(0,10cr 2 ) prior on both means p i and p 2 . It is 
easy to see that p\ and p 2 are independent, given (z,x), and the respective 
conditional distributions are 


u 


E 


i / (-1 + n j ) > O ' 2 / (-1 + ; 


where rij denotes the number of Zi s equal to j. Even more easily, it comes that 
the conditional posterior distribution of z given (pi, p 2 ) is a product of binomials, 
with 


Pr (Zi = l\xi, Hi,/J, 2 ) 


pexp{-(xi - p 1 ) 2 /2a 2 } 


pexp{-(xi - px) 2 /2a 2 } + (1 -p) exp{—(xi - ^ 2 ) 2 /2cr 2 } ' 

Figure [TO] illustrates the behaviour of the Gibbs sampler in that setting, with a 
simulated dataset of 100 points from the ,7JV(0, l)+.3A/ r (2.7,1) distribution. The 
representation of the MCMC sample after 5,000 iterations is quite in agreement 
with the posterior surface, represented via a grid on the space and some 

contours. The sequence of consecutive steps represented on the left graph also 
shows that the mixing behaviour is satisfactory, since the jumps are scaled in 
terms of the modal region of the posterior. 

This experiment gives a wrong sense of safety, though, because it does not 
point out the fairly large dependence of the Gibbs sampler to the initial conditions, 


already signalled in Diebolt and Robert (1994) under the name of trapping states. 
Indeed, the conditioning of (/i 1; /.i 2 ) on z implies that the new simulations of the 
means will remain very close to the previous values, especially if there are many 
observations, and thus that the new allocations z will not differ much from 
the previous allocations. In other words, to see a significant modification of the 
allocations (and thus of the means) would require a very very large number 
of iterations. Figure [ll] illustrates this phenomenon for the same sample as in 
Figure [lO] for a wider scale: there always exists a second mode in the posterior 
distribution, which is much lower than the first mode located around (0,2.7). 
Nonetheless, a Gibbs sampler initialised close to the second and lower mode will 
not be able to leave the vicinity of this (irrelevant) mode, even after a large 
number of iterations. The reason is as given above: to jump to the other mode, 
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Fig. 10. Gibbs sample of 5, 000 points for the mixture posterior (left) and path 


the last 100 consecutive steps (right) against the posterior surface (Source: Robert 
| and CaseUa\ \200$ . 



Fig. 11. Posterior surface and corresponding Gibbs sample for the two mean 
mixture model, when initialised close to the second and lower mode, based on 
10,000 iterations (Source: Robert and Casella\ 200$. 


a majority of Zj's would need to change simultaneously and the probability of 
such a jump is too close to 0 to let the event occur. 11 The literature on this 
specific issue—of exploring all the modes of the posterior distribution—has grown 
around the denomination of "label switching problem" . For instance, Celeux et al. 


(2000) have pointed out the deficiency of most MCMC samplers in this respect, 

while 

Friihwirth-Schnatter ( 

2001 

2004 

2006), 

Berkhof et al. 

(2003) 

Jasra et al.| 

(2005 

), and 

Geweke 

(2007) have proposed different devices to overcome this 


difficulty, in particular when testing for the number of components. See Lee 


et al. (2009) for a survey on recent developments. 


This example illustrates quite convincingly that, while the completion is 
natural from a model point of view (since it is a part of the definition of the 
model), it does not necessarily transfer its utility for the simulation of the 
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posterior. Actually, when the missing variable model allows for a closed form 


likelihood, as is the case for mixtures, probit models (Examples 10 and 131 
and even hidden Markov models (see Cappe et al. 2005), the whole range of 


the MCMC technology can be used as well. The appeal of alternatives like 
random walk Metropolis-Hastings schemes is that they remain in a smaller 
dimension space, since they avoid the completion step(s), and that they are 
not restricted in the range of their moves. 12 

Example 15 (Continuation of Example |14[ ). Given that the likelihood of 
a sample (xi,..., x n ) from the mixture distribution (11) can be computed in 
0(2 n) time, a regular random walk Metropolis-Hastings algorithm can be used 
in this setup. Figure |T2| shows how quickly this algorithm escapes the attraction 
of the poor mode, as opposed to the Gibbs sampler of Figure [Tl] within a few 
iterations of the algorithm, the chain drifts over the poor mode and converges 
almost deterministically to the proper region of the posterior surface. The random 
walk is based on 0.04) proposals, although other scales would work as 

well but would require more iterations to reach the proper model regions. For 
instance, a scale of 0.005 in the Normal proposal above needs close to 5,000 
iterations to attain the main mode. 




Fig. 12. Track of a 1,000 iteration random walk Metropolis-Hastings chain on 
the posterior surface, the starting point being indicated by a cross. (The scale of 
the random walk is 0.2 .) 


The secret of a successful MCMC implementation in such latent variable 
models is to maintain the distinction between latency in models and latency 
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in simulation (the later being often called use of auxiliary variables). When 
latent variables can be used with adequate mixing of the resulting chain 
and when the likelihood cannot be computed in a closed form (as in hidden 


semi-Markov models, Cappe et al. 2004), a Gibbs sampler is a rather simple 


solution that is often easy to simulate from. Adding well-mixing random walk 
Metropolis-Hastings steps in the simulation scheme cannot hurt the overall 
mixing of the chain (Robert and Casella 2004 Chap. 13), especially when 


several scales can be used at once (see Section|5fl Note also that the technique 
of tempering that flattens the target distribution in order to facilitate its 
exploration by a Markov chain is available for most latent variable models, if 


sometimes in rudimentary versions. See Chopin (2007) and Marin and Robert 


(2007, Section 6.6) for an illustration in the setups of hidden Markov models 
and of mixtures (Example |15[ ), respectively. A final word is that the latent 
variable completion can be conducted in an infinity of ways and that several 
of these should be tried or used in parallel to increase the chances of success. 


4.3 Reversible jump algorithms for variable dimension models 


As described in Section [2731 model choice is computationally different from 
testing in that it considers at once a (much) wider range of models 9JI; and 
parameter spaces Oi. Although early approaches could only go through a 
pedestrian pairwise comparison, a more adequate perspective is to envision 
the model index p as part of the parameter to be estimated, as in ©>■ The 
(computational) difficulty is that we are then dealing with a possibly infinite 
space 13 that is the collection of unrelated sets: how can we then simulate 
from the corresponding distribution? 14 


The MCMC solution proposed by Green (1995) is called reversible jump 


MCMC , because it is based on a reversibility constraint on the transitions 
between the sets 0,;. In fact, the only real difficulty compared with previous 
developments is to validate moves (or jumps ) between the 0,’s, since propos¬ 
als restricted to a given 0* follow from the usual (fixed-dimensional) theory. 
Furthermore, reversibility can be processed at a local level: since the model 
indicator /i is a integer-valued random variable, we can impose reversibility 
for each pair (fci, k 2 ) of possible values of /i. The idea at the core of reversible 
jump MCMC is then to supplement each of the spaces Ok 1 and Ok 2 with 
adequate artificial spaces in order to create a bijection between them. For 
instance, if dim(0fc 1 ) > dim(0fc 2 ) and if the move from 0^ to Ok 2 can be 
represented by a deterministic transformation of 6^ kl > 

o {k2) =r kl ^ k2 (e^), 

Green (1995) imposes a dimension matching condition which is that the op¬ 
posite move from 0fc 2 to 0^ is concentrated on the curve 


: 0™ =T kl ^ k2 (0^)} . 
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In the general case, if 9 is completed by a simulation ~ gkA u ki) into 
{9^ kl \u kl ) and 9^ by u k2 ~ gk 2 ( u k 2 ) into {9^ k2 \u k2 ) so that the mapping 
between (#( fcl ),M fc J and ( 9^ , itfe 2 ) is a bijection, 

(9^ k2 \uk 2 ) = Tk 1 ^.k 2 (^ kl \ u k 1 )i ( 12 ) 


the probability of acceptance for the move from model 9 Ul kl to model 'DT k 2 is 
then 

. / 7r(fc 2 ,fl (fc2) ) 7T2i gk 2 (u k2 ) dT kl ^ k2 {9 (kl \u kl ) 
mm \n(k ll 9^'>) Tr 12 g kl ( u k 1 ) d(9^ kl \u kl ) 

involving 

- the Jacobian of the transform T kl ^ k2 „ 

- the probability 7 Tij of choosing a jump to M. kj while in A4 ki , and 

- < 7 ^, the density of u,. 



The acceptance probability for the reverse move is based on the inverse ratio 
if the move from 9 7l k2 to 971*^ also satisfies (121 with u k2 ~ gk 2 (u k2 )- 15 
The pseudo-code representation of Green’s algorithm is thus as follows: 


Green’s Algorithm 


At iteration t, if a;^ = (m,9^ m ^), 

1. Select model 9Jt„ with probability Tr mn 

2. Generate Umn rv *' g^mnigi) 

3. Set (0W,« nm ) = T m ^ n {9^ m \u mn ) 

4. Take = ( n,9 with probability 


7 r(n,fl (n) ) TTnmWnmiynm) 

, 7r(?7l, ^b n ) ) 7 ^mn^Pranig^mn) 

and take = a;^- 1 otherwise. 


dr m ^ ra (fl( m ), Mmn ) 

d(9(- m ),Umn) 


Even more than previous methods, the implementation of this algorithm 
requires a high degree of skillfulness in picking the right proposals and the 
appropriate scales. Indeed, it may be argued that the ultimate dependence 
of the method on the pairwise completion schemes prevents an extension 
of its use by a broader audience. This “art of reversible jump MCMC” is 
illustrated on the two following examples, extracted from Robert and Casella 
(2004, Section 14.2.3). 

Example 16 (Continuation of Example [6]). If we consider for model 971 a, 
the k component normal mixture distribution, 

k 

3=1 
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moves between models involve changing the number of components in the mix¬ 
ture and thus adding new components or removing older components or yet 
again changing several components. As in Richardson and Green ( |1997 1, we can 
restrict the moves when in model DJlk to only models 9Jlk+i and The 

simplest solution is to use a birth-and-death process: The birth step consists in 
adding a new normal component in the mixture generated from the prior and 
the death step is the opposite, removing one of the k components at random, in 
this case, the corresponding birth acceptance probability is 

. f Tt(k+i)k (k + 1)! ir k+1 (9 k+1 ) 


\^k(k+ 1) k\ ?Tfc {k T l)^/c(fc+l) (^fc(fc-f-l)) 

. /7T(fc+l)fc g(k + 1) l k+1 (6 k+1 ) (1 -pfc + i) fe_1 


,1 


1 


where t k denotes the likelihood of the k component mixture model 9Jt k and g(k) 
is the prior probability of model SUlfc. 16 

While this proposal can work well in some setting, as in Richardson and Green 
(1997) when the prior is calibrated against the data, it can also be inefficient, 
that is, leading to a high rejection rate, if the prior is vague, since the birth 
proposals are not tuned properly. A second proposal, central to the solution of 
Richardson and Green ( |1997 1, is to devise more local jumps between models, 
called split and combine moves, since a new component is created by splitting an 
existing component into two, under some moment preservation conditions, and 
the reverse move consists in combining two existing components into one, with 


symmetric constraints that ensure reversibility. (See, e.g., Robert and Casella 
2004 for details.) 


Figures 


illustrate the implementation of this algorithm for the so-called 


Galaxy dataset used by Richardson and Green (1997) (see also Roeder 


1992), 


which contains 82 observations on the speed of galaxies. On Figure [13j the 
MCMC output on the number of components k is represented as a histogram 
on k, and the corresponding sequence of k's. The prior used on k is a uniform 
distribution on {1,..., 20}: as shown by the lower plot, most values of k are 
explored by the reversible jump algorithm, but the upper bound does not appear 


14 


to be restrictive since the fcW’s hardly ever reach this upper limit. Figure 
illustrates the fact that conditioning the output on the most likely value of k (3 
here) is possible. The nine graphs in this Figure show the joint variation of the 
three types of parameters, as well as the stability of the Markov chain over the 
1,000,000 iterations: the cumulated averages are quite stable, almost from the 
start. 

The density plotted on top of the histogram in Figure [15] is another good 
illustration of the inferential possibilities offered by reversible jump algorithms, as 
a case of model averaging: this density is obtained as the average over iterations 
t of 


feW 

1=1 




jk’\Yjk , 
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which approximates the posterior expectation E[/(y|0)|x], where x denotes the 
data xi,... ,x S2 - 


Histogram 


of k 




Fig. 13. Histogram and raw plot of 100, 000 fc’s produced by a reversible jump 
MCMC algorithm for the Galaxy dataset. 


Example 17 (Continuation of Example [3]). For the AR(p ) model of 
Example [3] the best way to include the stationarity constraints is to use the 
lag-polynomial representation 

p 

I] (1 AiS) X t =e t) e t ~ AF(0, a 2 ), 

i—i 


of model 9 7t p , and to constrain the inverse roots, A*, to stay within the unit circle 
if complex and within [—1,1] if real (see, e.g. Robert 2007 Section 4.5.2). The 
associated uniform priors for the real and complex roots A,- is 


7Tp(A) 


1 

b/2J +1 


n 2 i ^ <i n 

AiCR Ai^R 


where \p/2\ + 1 is the number of different values of r p . This factor must be 
included within the posterior distribution when using reversible jump since it 
does not vanish in the acceptance probability of a move between models 9Jl p 
and 'Mq. Otherwise, this results in a modification of the prior probability of each 
model. 

Once again, a simple choice is to use a birth-and-death scheme where the 
birth moves either create a real or two conjugate complex roots. As in the birth- 
and-death proposal for Example [16] the acceptance probability simplifies quite 
dramatically since it is for instance 

min ( ( r p T !)• Lp/ 2] + 1 lp+i(flp+i) -A 

V^pCp+i) 0> ! Lb+ 1 )/ 2 J+ 1 4bp) ’ ) 
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Fig. 14. Reversible jump MCMC output on the parameters of the model M 3 for 
the Galaxy dataset, obtained by conditioning on k = 3. The left column gives the 
histogram of the weights, means, and variances; the middle column the scatterplot of 
the pairs weights-means, means-variances, and variances-weights; the right column 
plots the cumulated averages (over iterations) for the weights, means, and variances. 
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Fig. 16. Output of a reversible jump algorithm based on an AR( 3) simulated 
dataset of 530 points (upper left) with true parameters 9i (—0.1, 0.3, —0.4) and a = 
1. The first histogram is associated with k, the following histograms are associated 
with the Of s, for different values of k, and of a 2 . The final graph is a scatterplot of 
the complex roots (for iterations where there were complex roots). The one before 
last graph plots the evolution over the iterations of 0 1 , 62 , #3 (Source: Robert 2003). 


in the case of a move from DJl p to 9Jl p+1 . (As for the above mixture example, 
the factorials are related to the possible choices of the created and the deleted 
roots.) 

Figure [16] presents some views of the corresponding reversible jump MCMC 
algorithm. Besides the ability of the algorithm to explore a range of values of k, 
it also shows that Bayesian inference using these tools is much richer, since it 
can, for instance, condition on or average over the order k, mix the parameters 
of different models and run various tests on these parameters. A last remark 
on this graph is that both the order and the value of the parameters are well 
estimated, with a characteristic trimodality on the histograms of the Of s, even 
when conditioning on k different from 3, the value used for the simulation. 

In conclusion, while reversible jump MCMC is a generic and powerful 
method for handling model comparison when faced with a multitude of mod¬ 
els, its sensitivity to the tuning “parameters” that determine the pairwise 
jumps makes us favour the alternative of computing Bayes factors. When the 
number of models under comparisons is sufficiently small to allow for the 
computation of all marginal likelihoods in parallel, this computational solu¬ 
tion is more efficient. It is indeed quite rare that the structure of the posterior 
under a model SEJli provides information about the structure of the posterior 
under another model SDI 2 , unless both models are quite close. Intrinsically, 
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reversible jump MCMC is a random walk over the collection of models and, 
as such, looses in efficiency in its exploration of this collection. (In particu¬ 
lar, it almost never makes sense to implement reversible jump MCMC when 
comparing a small number of models.) 


5 More Monte Carlo Methods 

While MCMC algorithms considerably expanded the range of applications 
of Bayesian analysis, they are not, by any means, the end of the story! Fur¬ 
ther developments are taking place, either at the fringe of the MCMC realm 
or far away from it. We indicate below a few of the directions in Bayesian 
computational Statistics, omitting many more that also are of interest... 


5.1 Adaptivity for MCMC algorithms 


Given the range of situations where MCMC applies, it is unrealistic to hope 
for a generic MCMC sampler that would function in every possible set¬ 
ting. The more generic proposals like random-walk Metropolis-Hastings al¬ 
gorithms are known to fail in large dimension and disconnected supports, 


because they take too long to explore the space of interest (Neal 2003). The 


reason for this impossibility theorem is that, in realistic problems, the com¬ 
plexity of the distribution to simulation is the very reason why MCMC is 
used! So it is difficult to ask for a prior opinion about this distribution, its 
support or the parameters of the proposal distribution used in the MCMC 
algorithm: intuition is close to void in most of these problems. 

However, the performances of off-the-shelve algorithms like the random- 
walk Metropolis- Hastings scheme bring information about the distribution 
of interest and, as such, should be incorporated in the design of better and 
more powerful algorithms. The problem is that we usually miss the time to 
train the algorithm on these previous performances and are looking for the 
Holy Grail of automated MCMC procedures! While it is natural to think that 
the information brought by the first steps of an MCMC algorithm should be 
used in later steps, there is a severe catch: using the whole past of the “chain” 
implies that this is not a Markov chain any longer. Therefore, usual conver¬ 
gence theorems do not apply and the validity of the corresponding algorithms 
is questionable. Further, it may be that, in practice, such algorithms do de¬ 
generate to point masses because of a too rapid decrease in the variation of 
their proposal. 

Example 18 (Continuation of Example[8]). For the ^-distribution sample, 
we could fit a normal proposal from the empirical mean and variance of the 
previous values of the chain, 




= 7^^ and ^ = 
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This leads to a Metropolis-Hastings algorithm with acceptance probability 

' 1 / + (Xj - dW) 2 r (v+1)/2 exp-(/i t - 9^) 2 /2a 2 


n 

3= 2 


’ + 0 3 ~ 0 2 


exp-(Mt -f) 2 j2a\ 


where £ is the proposed value from Af(fL t ,a 2 ). The invalidity of this scheme 
(because of the dependence on the whole sequence of #W's till iteration t) is 


illustrated in Figure 17 when the range of the initial values is too small, the 
sequence of 0M's cannot converge to the target distribution and concentrates 
on too small a support. But the problem is deeper, because even when the range 
of the simulated values is correct, the (long-term) dependence on past values 
modifies the distribution of the sequence. Figure fl8] shows that, for an initial 
variance of 2.5, there is a bias in the histogram, even after 25,000 iterations and 
stabilisation of the empirical mean and variance. 



Fig. 17. Output of the adaptive scheme for the t-distribution posterior with a 
sample of 10 Xj ~ 75 and initial variances of (top) 0.1, (middle) 0.5, and (bottom) 
2.5. The left column plots the sequence of 6 ^’s while the right column compares 
its histogram against the true posterior distribution (with a different scale for the 
upper graph). 
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Fig. 18. Comparison of the distribution of an adaptive scheme sample of 25, 000 
points with initial variance of 2.5 and of the target distribution. 


Even though the Markov chain is converging in distribution to the target 
distribution (when using a proper, i.e. time-homogeneous updating scheme), 
using past simulations to create a non-parametric approximation to the tar¬ 
get distribution does not work either. Figure [19| shows for instance the output 
of an adaptive scheme in the setting of Example [lS] when the proposal dis¬ 
tribution is the Gaussian kernel based on earlier simulations. A very large 
number of iterations is not sufficient to reach an acceptable approximation 
of the target distribution. 



Fig. 19. Sample produced by 50, 000 iterations of a nonparametric adaptive 
MCMC scheme and comparison of its distribution with the target distribution. 


The overall message is thus that, without further guidance, one should not 
constantly adapt the proposal distribution on the past performances of the 
simulated chain. Either the adaptation must cease after a period of burnin 
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(not to be taken into account for the computations of expectations and quan¬ 
tities related to the target distribution). Else, the adaptive scheme must be 
theoretically assess on its own right. This later path is not easy and only a few 


examples can be found (so far) in the literature. See, e.g., Gilks et al. (19981 


who use regeneration to create block independence and preserve Markovianity 


on the paths rather than on the values, Haario et al. (1999 2001) who derive 


a proper adaptation scheme in the spirit of Example 18] by using a ridge-like 
correction to the empirical variance, and Andrieu and Robert (2001) who 


propose a more general framework of valid adaptivity based on stochastic 
optimisation and the Robbin-Monro algorithm. A more generic perspective 


is found in Roberts and Rosenthal (2009), who tune the random walk scale in 


each direction of the parameter space toward an optimal acceptance rate of 


0.44, a rate suggested in Gelman et al. (1996). Roberts and Rosenthal (20091 


provide validated constraints on the tuning scheme to the extent of offering 


an R package called amcmc and described in Rosenthal (2007). More precisely. 


for each component of the simulated parameter, a factor Si corresponding to 
the logarithm of the random walk standard deviation is updated every 50 
iterations by adding or subtracting a factor e t depending on whether or not 
the average acceptance rate on that batch of 50 iterations and for this com¬ 
ponent was above or below 0.44. If e t decreases to zero as min(.01,1 /Vt), 
the conditions for convergence are satisfied. (See Robert and Casella (2009, 
Section 8.5.2, for more details.) 


5.2 Population Monte Carlo 

To reach acceptable adaptive algorithms, while avoiding an extended study 
of their theoretical properties, a better alternative is to leave the setup of 
Markov chain simulations and to consider instead sequential or population 


Monte Carlo methods (Iba 2000 Cappe et al. 2004) that have much more 


in common with importance sampling than with MCMC. They are inspired 
from particle systems that were introduced to handle rapidly changing tar¬ 


get distributions like those found in signal processing and imaging (Gordon 


et al.[ |1993| |Shephard and Pitt] |1997] |Doucet et al.] |2001[ |Del Moral et al. 


2006) but primarily handle fixed but complex target distributions by building 


a sequence of increasingly better proposal distributions. 17 Each iteration of 
the population Monte Carlo (PMC) algorithm thus produces a sample ap¬ 
proximately simulated from the target distribution but the iterative structure 
allows for adaptivity toward the target distribution. Since the validation is 
based on importance sampling principles, dependence on the past samples 
can be arbitrary and the approximation to the target is valid (unbiased) at 
each iteration and does not require convergence times nor stopping rules. 

If t indexes the iteration and i the sample point, consider proposal distri¬ 
butions qn that simulate the xf^'s and associate to each xf' 1 an importance 
weight 
























































40 


0i ] =n(Xi t) )/qit(Xi ) ), i = 1,... ,n. 

(The proposal distribution thus depends both on the iteration and on the 
particle index.) Approximations of the form 


3t 


1 

n 


<■(*!”) 


are then unbiased estimators of E ir [/i(X)], even when the importance distri¬ 
bution q it depends on the entire past of the experiment. Indeed, if £ denotes 
the vector of past random variates that contribute to qu, and g(() its arbitrary 
distribution, we have 


// s ®'* w, “ <I,dI! ' (0d< = // 

Furthermore, assuming that the variances 


/i(a;)7r(a;)dxg((C)dC = E 7r [/i(A')]. 


var 



exist for every 1 < i < n, we have 


1 n 

var Pt) = -^ 5I var (efPp! 0 )) 
1=1 


due to the cancelling effect of the weights £>■* . 

Since, usually, the density 7r is unsealed, we use instead 


« K 


i = 1,... ,n, 


scaled so that the of 1 ~'s sum up to 1. In this case, the unbiasedness is lost, 
although it approximately holds. In fact, the estimation of the normalising 
constant of 7r improves with each iteration t, since the overall average 


Wt 


t n 




(r)\ 


i=l %r(^ T) ) 


is convergent. Therefore, as t increases, w t contributes less and less to the 
variability of . 

However, Douc_etjiL _(j2007a) have shown that this use of the importance 
weights of 1 in Cappe et al. (2004) was not adaptive enough and they proposed 
a Rao-Blackwellised substitute 1 ® 


(t) 

Si 


r 0h (t) ) 


E 


Oh 


i = 1, •••,«, 
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that guaranteed an (asymptotic in n) improvement of the proposal at each 

~ 2007a|b 


iteration t, under specific forms of the g it ’s (Douc et al. 


Cappe et al. 


20081 


Since the above establishes that an simulation scheme based on sample 
dependent proposals is fundamentally a specific kind of importance sampling, 
the following algorithm is validated by the same principles as regular impor¬ 
tance sampling: 


—Population Monte Carlo Algorithm— 

For t = 1,... ,T 

1. For i = 1,.. ., n, 

i) Select the generating distribution qu(-) 

ii) Generate x[^ ~ qu{x) 

iii) Compute g[ t] = n{xf ’)/ 

2. Normalise the g^'s to sum up to 1 

3. Resample n values from the x[^’s with replacement, using the 
weights g^\ to create the sample (x±\ ... ,x^) 


Step (i) is singled out because it is the central property of the PMC 
algorithm, namely that adaptivity can be extended to the individual level 
and that the can be picked based on the performances of the previous 
qi(t- i)’s or even on all the previously simulated samples, if storage allows. 
For instance, the qa's can include large tails proposals as in the defensive 
sampling strategy of |Hesterber g (1995), to ensure finite variance. Similarly, 
Warnes’ (2001) non-parame tric Gaussian kernel approxima t ion c an be used 
as a proposal. 19 (See also in Stavropoulos and Titterington 2001 the smooth 


bootstrap technique as an earlier example of PMC algorithm.) 

A major difference between the PMC algorithm and earlier proposals in 
the particle system literature is that past dependent moves as those of Gilks 
and Berzuini (2001) and Del Moral et al. (2006) remain within the MCMC 


framework, with Markov transition kernels with stationary distribution equal 
to 7r. 

Example 19 (Continuation of Example |14[ ). We consider here the imple¬ 
mentation of the PMC algorithm in the case of the the normal mixture (jTT|). 


As in Example 15 a PMC sampler can be efficiently implemented without the 
(Gibbs) augmentation step, using normal random walk proposals based on the 
previous sample of fi = (p. 1 ,/r 2 )'s. Moreover, the difficulty inherent to random 
walks, namely the selection of a “proper” scale, can be bypassed because of the 
adaptivity of the PMC algorithm. Indeed, the proposals can be associated with 
a range of variances Vk (1 < k < K) ranging from, e.g., 10 3 down to 10~ 3 . At 
each step of the algorithm, the new variances can be selected proportionally to 
the performances of the scales i>k on the previous iterations. For instance, a scale 
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can be chosen proportionally to its non-degeneracy rate in the previous iteration, 
that is, the percentage of points generated with the scale Vk that survived after 
resampling. 20 The weights are then of the form 


/ 



n { 



E“„ v 

jMi) 



)* 


to#" 

_1) ,v k , 

, 


Qi oc 


where <p(q\s, v ) is the density of the normal distribution with mean sand variance 
v at the point q. 

Compared with an MCMC algorithm in the same setting (see Examples 14 
and 151, the main feature of this algorithm is its ability to deal with multiscale 

n produces the 


proposals in an unsupervised manner. The upper row of Figure^ 
frequencies of the five variances Vk used in the proposals along iterations: The 
two largest variances Vk most often have a zero survival rate, but sometimes 
experience bursts of survival. In fact, too large a variance mostly produces points 
that are irrelevant for the posterior distribution, but once in a while a point 


6 


it) 


the 


gets close to one of the modes of the posterior. When this occurs, 

corresponding Qj is large and Op' is thus heavily resampled. The upper right graph 
shows that the other proposals are rather evenly sampled along iterations. The 
influence of the variation in the proposals on the estimation of the means /i! and 
can be seen on the middle and lower panels of Figure [22] First, the cumulative 
averages quickly stabilise over iterations, by virtue of the general importance 
sampling proposal. Second, the corresponding variances take longer to stabilise 
but this is to be expected, given the regular reappearance of subsamples with 
large variances. 

In comparison with Figures and [12] Figure [20] shows that the sample 
produced by the PMC algorithm is quite in agreement with the modal zone of 
the posterior distribution. The second mode, which is much lower, is not preserved 
in the sample after the first iteration. Figure [2l] also shows that the weights are 
quite similar, with no overwhelming weight in the sample. 

The generality in the choice of the proposal distributions qu is obviously 
due to the abandonment of the MCMC framework. The difference with an 
MCMC framework is not simply a theoretical advantage: as seen in Section 
H3 proposals based on the whole past of the chain do not often work. Even 
algorithms validated by MCMC steps may have difficulties: in one example of 


Cappe et al. (2004), a Metropolis-Hastings scheme does not work well, while 


a PMC algorithm based on the same proposal produces correct answers. 
Population Monte Carlo algorithms offer a theoretically valid solution to 
the adaptivity issue, with practical gains as well, as exemplified in Wraith| 


et al. (2009). The connection with nonparametric Bayes estimation has not 


yet been sufficiently pursued but the convergence of kernel-like proposals 


demonstrated by Cappe et al. (2008) shows this is a promising direction. 
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Fig. 20. Representation of the log-posterior distribution with the PMC weighted 
sample after 30 iterations (the weights are proportional to the circles at each point) 


(Source: Cappe et al. 2004)■ 






Fig. 21. Histograms of the PMC sample: sample at iteration 5 (left) before resam¬ 
pling and (right) after resampling. 
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Fig. 22. Performances of the mixture PMC algorithm for 1000 observations from 
a 0.2A/"(0,1) + 0.8A/”(2,1) distribution, with 0 = 1 A = 0.1, Vk = 5, 2, .1, .05, .01, 
and a population of 1050 particles: (upper left) Number of resampled points for 
the variances v\ = 5 (darker) and Vi = 2; (upper right) Number of resampled 
points for the other variances, V 3 = 0.1 is the darkest one; (middle left) Variance 
of the simulated pi's along iterations; (middle right) Cumulated average of the 
simulated pi's over iterations; (lower left) Variance of the simulated p 2 's along 
iterations; (lower right) Cumulated average of the simulated p 2 's over iterations 
(Source: Cappe et al.\ 2004). 


5.3 Approximate Bayesian computation 


One particular application of the Accept-Reject algorithm that has found a 
niche of its own in population genetics is called approximate Bayesian com¬ 
putation (ABC), following the denomination proposed by Pritchard et al. 
(1999). It is in fact an appealing substitute for “exact” (meaning convergent) 
Bayesian algorithms in settings where the likelihood function f(x\9) is not 
available, even up to a normalising constant, but where it can easily be simu¬ 
lated. The range of applications is thus much wider than population genetics 
and covers for instance graphical models and inverse problems. 
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Assuming, thus, that a posterior distribution 7 r( 0 |xo) oc ir(0)f(xo\0) is to 
be simulated, a rudimentary accept-reject algorithm generates values from 
the prior and from the likelihood until the simulated observation is equal to 
the original observation Xq- 
Repeat 

Generate 9 ~ w(0) and X ~ f{x\0) 


until X = Xq 

Since the conditional probability of acceptance is f(xo\8), the distribution of 
the accepted 6 is truly tt(0\xq), even when f(x\8) cannot be computed. 

The above algorithm is valid, but it is unfortunately restricted to settings 
where (a) n(0) is a proper prior and (b) Pr g(X = Xq) has a positive proba¬ 
bility of occurrence. Even in population genetics where the outcome X is a 
discrete random variable, the size of the state-space is often such that this 
algorithm cannot be implemented. The proposal of Pritchard et al. (1999) 
is to replace the exact acceptance condition X = Xq in the above with an 
approximate condition d(X, xq) < e, where d is a distance and e is a tolerance 
level. The corresponding ABC algorithm is thus: 


—Approximate Bayesian computation Algorithm— 

For i = 1,..., n, 

1 Generate 0j ~ 7 r(0) 

2 Generate Xi ~ f(x\0i) and compute d(xo,Xi) 

Accept the 0j's such that d(xi,x) < e. 


The output of the ABC algorithm is distributed from the distribution 
with density proportional to n(8) Prg{d(X, xo) < e}, where Prg represents 
the sampling distribution of X , indexed by 6. This density is somehow mis¬ 
takenly denoted by 7 t{0 \ d(x,x o) < e}, where the conditioning corresponds 
to the marginal distribution of d(x,x o) given xq. While unavoidable, this 
inherent approximation step makes the ABC method difficult to assess and 
to compare with regular Monte Carlo approaches, even though recent works 
replace it within a non-parametric framework that aims at approximating 
the conditional density function f(x\6) and hence envision e as a potential 
bandwidth (Beaumont et al. 2002| Blum and Frangois 20101. 

Improvements on the original scheme have been achieved either by mod¬ 
ifying the proposal distribution of the parameter 0, in order to increase the 


density of x’s within the vicinity of y (Marjoram et al. 2003 Bortot et al. 


2007), or, as stated above, by envisioning the problem as a conditional density 


estimation issue and by developing techniques to allow for larger e (Beaumont 


et al. 2002). For instance, Marjoram et al. (2003) defined a Markov chain 


Monte Carlo (MCMC) version of the ABC algorithm that enjoys the same 
validity as the original, namely that, if a Markov chain (8^) is created via 
the transition function 
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9' ~ K(9' | 0W) if x ~ f(x | 0') is such that a; = xq 


5»Ci+!) = 


0 (‘) 


and u ~ W(0,1) < 
otherwise, 


tt(0')K(QW | 0 ') 

7T (eW)K(e' 1 0W) ’ 


its stationary distribution is the posterior 7r(0 | x 0 ). Again, in most settings, 
exact equality is not achievable and the above constraint x = Xq is replaced 


with the approximation d(x, Xq) < e. In Beaumont et al. (2009), a population 
Monte Carlo modification of ABC is introduced, resulting into the algorithm: 


—PMC-ABC Algorithm— 


Given a decreasing sequence of tolerance thresholds ei > ... > et, 

1. At iteration t = 1, 

For i = 1 ,N 

Simulate 9^ ~ 7r(0) and x ~ f{x \ 9 k) until g{x,y) < e\ 
Set uj 1J = 1/A 

Take r 2 as twice the empirical variance of the 0 ? - 1 ’s 

2. At iteration 2 < t < T, 


For i = 1, N, repeat 

Pick 9 * from the 0^ t_1 ' l ’s with probabilities uk -1 "* 


generate 0^ | 0* ~ A/’(0*,r i 2 ) and x ~ /(x \ 9f’) 


(*)n 


until p(x,y) < Ei 


Set oc 7r(0f-’)/Ef=i 


(*)> 




k‘( 


0 „ (t) -1 


/*-i) 
3 


)} 


Take r 2 +1 


as twice the weighted empirical variance of the 


r\(t) , 


From a practical viewpoint, the number of iterations T can be controlled 
via the modifications in the parameters of K t , a stopping rule being that the 
iterations should stop when those parameters have settled, while the more 
fundamental issue of selecting a sequence of e t ’s towards a proper approxi¬ 
mation of the true posterior can rely on the stabilisation of the estimators of 
some quantities of interest associated with this posterior. But there is gener¬ 
ally no fixed-cost solution that let et decrease to zero with t. Note also that 
the SMC algorithm of Del Moral et al. (2006) has been recently extended to 
this ABC setup, making the resulting algorithm a direct competitor of the 
above PMC-ABC algorithm. 


6 Conclusion 

This short overview of the problems and solutions considered for Bayesian 
Statistics is nothing but an introduction to the game: there are much more 
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complex problems than those illustrated above and much more advanced 
techniques than those presented in these pages. The reader is then encouraged 
to enter the literature on the topic, maybe with other introductory surveys 
like Cappe and Robert (|2000 ) and Andrieu et al. (20041, but mostly through 


books like Chen et al. (2000 , Doucet et al. (2001), Liu (20011 


Casella 

(2004), 

Albert 

(2007), 

Casella 

(2009). 


Marin and Robert (2007aI, and 


Robert and 


Robert and 


We have not mentioned so far entries to Bayesian softwares like winBUGS, 


developed by the MRC Unit in Cambridge (Gilks et al. 1994 Spiegelhalter 


et al. 1999), Ox (Doornik et al. 2002), BATS (Pole et al. 19941, BACC 


(Geweke 1999) and the Minitab package of Albert (1996), which all cover 


some aspects of Bayesian computing. Obviously, these packages require some 
expertise from the user and are thus more difficult of use than the classi¬ 
cal open source or commercial softwares like R, Splus, Statgraphics, StatXact, 
SPSS or SAS. In other words, they are not black boxes that could be used 
by laymen with no statistical background. But this entrance fee to the use of 
Bayesian softwares is inevitable, given the versatile nature of Bayesian analy¬ 
sis: since it offers much more variability than standard inferential procedures, 
through the choice of prior distributions and loss functions for instance, it 
also requires more input from the user! And, once these preliminary steps 
have been overcome, the programming involved in a software like winBUGS is 
rather limited and certainly not harder than writing a code in R or Matlab. 21 

As stressed in this Chapter, computational issues are central to the de¬ 
sign and implementation of Bayesian analysis. The new era opened by the 
MCMC methodology has brought much more freedom in the use of Bayesian 
methods, as reflected by the increase of Bayesian studies in applied Statistics. 
As usually the case, a strong increase in the use of a methodology also sees a 
corresponding increase in its misuse! Inconsistent data-dependent priors and 
improper posteriors are sometimes appearing in studies and, more generally, 
the assessment of prior modelling (or even of MCMC convergence) are rarely 
conducted with sufficient care. This is somehow a price to pay for the wider 
range of Bayesian studies, while the improvement of corresponding software 
should bring more guidelines and warnings about these misuses of Bayesian 
analysis. 
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Notes 


^Tn this chapter, the denomination universal is used in the sense of uni¬ 
formly over all distributions. 

2 To impose the stationarity constraint when the order of the AR(p) model 
varies, it is necessary to reparameterise this model in terms of either the 
partial autocorrelations or of the roots of the associated lag polynomial. (See, 
e.g. 


Robert 2007 Section 4.5.) 


d In this presentation of Bayes factors, we completely bypass the method¬ 
ological difficulty of defining 7r {8 £ 6>o) when 6>o is of measure 0 for the 
original prior tt and refer the reader to Robert (2007, Section 5.2.3) and 
Marin and Robert (2007, Section 2.3.2) for proper coverage of this issue. 

4 The prior distribution can be used for importance sampling only if it is 
a proper prior and not a cr-finite measure. 

5 The constant order of the Monte Carlo error does not imply that the 
computational effort remains the same as the dimension increases, most ob¬ 
viously, but rather that the decrease (with m) in variation has the rate 1/i/ro. 

6 The empirical (Monte Carlo) confidence interval is not to be confused 
with the asymptotic confidence interval derived from the normal approxi¬ 
mation. As discussed in Robert and Casella (2004, Chapter 4), these two 
intervals may differ considerably in width, with the interval derived from the 
CLT being much more optimistic! 

7 An alternative to the simulation from one T(v,Xi, 1) distribution that 
does not require an extensive study on the most appropriate Xi is to use a 
mixture of the T{y, Xi, 1) distributions. As seen in Section 5.2 the weights of 
this mixture can even be optimised automatically. 

8 We stress the point that this is mostly an academic exercise as, in regular 
settings, it is rarely the case that independent components are used for the 
importance function. 

9 Section 4.3 covers in greater details the setting of varying dimension 
problems, with the same theme that completion distributions and parameters 
are necessary but influential for the performances of the approximation. 

10 Even in the simple case of the probit model, MCMC algorithms do not 
always converge very quickly, as shown in Robert and Casella (2004, Chapter 
14). 


11 It is quite interesting to see that the mixture Gibbs sampler suffers from 
the same pathology as the EM algorithm, although this is not surprising given 
that it is based on the same completion scheme. 

12 This wealth of possible alternatives to the completion Gibbs sampler is a 
mixed blessing in that their range, for instance the scale of the random walk 
proposals, needs to be scaled properly to avoid inefficiencies. 








49 


13 The difficulty with the infinite part of the problem is easily solved in that 
the setting is identical to simulation problems in (countable or uncountable) 
infinite spaces. When running simulations in those spaces, some values are 
never visited by the simulated Markov chain and the chances a value is visited 
is related with the probability of this value under the target distribution. 

14 Early proposals to solve the varying dimension problem involved satu¬ 
ration schemes where all the parameters for all models were updated deter¬ 
ministically (Carlin and Chib 1995), but they do not apply for an infinite 


collection of models and they need to be precisely calibrated to achieve a 
sufficient amount of moves between models. 

15 For a simple proof that the acceptance probability guarantees that the 
stationary distribution is Tr(k 1 8 l ' k ' > ), see Robert and Casella (2004, Section 

11 . 2 . 2 ). 

16 In the birth acceptance probability, the factorials fc! and (k 1)! appear 
as the numbers of ways of ordering the k and k + 1 components of the mix¬ 
tures. The ratio cancels with l/(k + 1), which is the probability of selecting 
a particular component for the death step. 

17 The “sequential” denomination in the sequential Monte Carlo methods 
thus refers to the algorithmic part, not to the statistical part. 

18 The generic Rao-Blackwellised improvement was introduced in the orig¬ 


inal MCMC paper of Gelfand and Smith (1990) and studied by Liu et al. 


(1994) and Casella and Robert (1996). More recent developments are pro¬ 


posed in Cornuet et al. (2009), in connection with adaptive algorithms like 
PMC. 

19 Using a Gaussian non-parametric kernel estimator amounts to (a) sam¬ 
pling from the xf* ’s with equal weights and (b) using a normal random walk 
move from the selected xf\ with standard deviation equal to the bandwidth 
of the kernel. 

20 When the survival rate of a proposal distribution is null, in order to 
avoid the complete removal of a given scale Vk, the corresponding number Vk 
of proposals with that scale is set to a positive value, like 1% of the sample 
size. 

21 An R package called +mcsm+has been developed in association with 


Robert and Casella (2009) for training about Monte Carlo methods. 
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